MultiLoc2 predicts subcellular localizations for a given protein sequence. We would like to run the local version of this code over all possible human protein sequences to be safe.
To do this, we first need FASTA sequences of all human protein sequences.
In a previous notebook all associated protein sequences for each Entrez ID were retreived. We require canonical sequences for each Entrez ID. These can be found using the Mapping files on the iRefIndex website. See this page for information on how iRefIndex handles canonicalisation.
In [1]:
cd ../../iRefIndex/
In [2]:
!head -n 3 mappings.txt
Columns 2 and 3 contain the external identifier and the Entrez Gene ID. We would like to match the human Gene IDs to the Refseq IDs. An easy way to do this is to download the Refseq FASTA entries for all human genes and then map these back to the Entrez Gene IDs with this table.
In [4]:
!wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
In [5]:
!gunzip human.protein.faa.gz
We can parse this FASTA file using Biopython:
In [4]:
from Bio import SeqIO
In [5]:
proteinrecords = list(SeqIO.parse("human.protein.faa","fasta"))
Printing an example entry to check:
In [6]:
print proteinrecords[0].id.split("|")[3]
In [7]:
refseqtoentry = {}
for r in proteinrecords:
try:
refseqtoentry[r.id.split("|")[3]] += [r]
except KeyError:
refseqtoentry[r.id.split("|")[3]] = [r]
Checking that each ID is unique as expected for Refseq IDs:
In [8]:
ridlengths = []
for k in refseqtoentry.keys():
if len(refseqtoentry[k]) != 1:
print "Entry {0} has length {1}".format(k,len(refseqtoentry[k]))
So everything is ok. Continuing taking these refseq IDs and mapping them onto Entrez IDs:
In [10]:
import csv
In [17]:
#prime the dictionary
refseqtoentrez={}
for k in refseqtoentry.keys():
refseqtoentrez[k] = []
In [18]:
f = open("mappings.txt")
c = csv.reader(f,delimiter="\t")
lc = 0
for l in c:
try:
refseqtoentrez[l[1]] += [l[3]]
except KeyError:
#then it's not a human refseq
pass
if lc%10000000 == 0:
print "Reached line {0}".format(lc)
lc += 1
f.close()
In [19]:
print "{0} Refseq IDs mapped to {1} Entrez IDs".format(len(refseqtoentrez.keys()),
len(list(flatten(refseqtoentrez.values()))))
Check that this mapping is 1 to 1:
In [21]:
ridlengths = []
for k in refseqtoentrez.keys():
if len(refseqtoentrez[k]) > 1:
print "Entry {0} has length {1}".format(k,len(refseqtoentrez[k]))
So all those that are not zero map precisely 1 to 1. All of those Entrez IDs are the proteins we would like to run through the MultiLoc2 script. Then, we can iterate over the these dictionaries and build a new dictionary relating Entrez IDs to corresponding canonical sequences:
In [12]:
entreztorecord = {}
for k in refseqtoentrez.keys():
if refseqtoentrez[k]:
entreztorecord[refseqtoentrez[k][0]] = refseqtoentry[k][0]
In [12]:
import pickle
In [23]:
SeqIO.write(flatten(entreztorecord.values()),"human.canonical.refseq.fasta","fasta")
Out[23]:
In [16]:
f = open("human.canonical.entrez.pickle","wb")
pickle.dump(entreztorecord,f)
f.close()
The dictionary mapping 1 to 1 Entrez to canonical refseq sequences could also come in useful so that should also be stored:
In [25]:
f = open("human.canonical.refseqtoentrez.pickle","wb")
pickle.dump(refseqtoentrez,f)
f.close()
To install downloaded the tar file from here and installed the dependencies through the AUR:
Then ran the configuration script:
In [36]:
cd ../multiloc2/MultiLoc2-26-10-2009/
In [37]:
!python2 configureML2.py
First, testing the script on just a few entries:
In [38]:
!python2 src/multiloc2_prediction.py
In [41]:
#writing a small FASTA file here:
testlist = entreztorecord.values()[0:10]
SeqIO.write(testlist,"human.testlist.fasta","fasta")
Out[41]:
In [43]:
!python2 src/multiloc2_prediction.py -fasta=human.testlist.fasta -origin=animal -result=human.testlist.multiloc2
Running in parallel may be faster:
In [45]:
#write 10 small files:
for x,r in zip(range(10),testlist):
SeqIO.write([r],"human.{0}.fasta".format(x),"fasta")
In [68]:
#write a script to execute all of the files:
f = open("runmultiloc2.sh", "w")
f.write(r"#!/bin/bash")
f.write("\n")
for x in range(10):
f.write("python2 src/multiloc2_prediction.py -fasta=human.{0}.fasta -origin=animal -result=human.{0}.multiloc2".format(x))
f.write(" & \n")
f.write("wait\n")
f.write("echo complete")
f.close()
In [69]:
!cat runmultiloc2.sh
In [70]:
!chmod +x runmultiloc2.sh
In [ ]:
!./runmultiloc2.sh
Cleaning up:
In [73]:
!rm human.*
In [72]:
import time
In [77]:
testlist = entreztorecord.values()[0:1000]
SeqIO.write(testlist,"human.testlist.fasta","fasta")
Out[77]:
In [78]:
pre = time.time()
In [79]:
!python2 src/multiloc2_prediction.py -fasta=human.testlist.fasta -origin=animal -result=human.testlist.multiloc2
In [80]:
post = time.time()
As you can see, it had an error so hasn't executed.
The ENTs paper uses results for the MultiLoc2 script for the same task we are working on: protein-protein interaction prediction. Instead of using the MultiLoc2 script we can therefore just use the precomputed results in the file found in ENTs standalone download here.
In [81]:
cd ../../ents/standalone/
In [82]:
ls
In [83]:
!head h_sapiens_subcellular_multiloc2.out
So what's this identifier they're using, example from above: ENSP00000442112
.
Looks like a human protein Ensembl ID.
So I just need to map from the list of canonical human proteins we have to this.
We would like to use the same features used in this paper as they appear to be effective, so we have to use a similar method to parse the files in the ENTs folder to that used in the paper. To do this, will have to inspect the code.
Opening an new notebook to repurpose the ENTs code into our pipeline.